CS109B: Advanced Topics in Data ScienceHarvard University
Spring 2020
Instructors: Pavlos Protopapas, Mark Glickman and Chris Tanner
Project Advisors: Douglas Finkbeiner and Jun Yin
Vast amounts of data: As astronomers collect more and more image data, there is a need for further development of automated, reliable, and fast analysis methods.
Traditional methods: For example, to describe a galaxy, software is used to fit a parametric or a non-parametric model and estimate the galaxy's brightness, shape, size, and orientation. A commonly used, though not a perfect model, is the Sérsic profile, developed in 1963.
New methods: In recent years, researchers have started exploring the application of deep learning methods to analyze astronomical data. For instance, classification of galaxies using CNNs has demonstrated high levels of accuracy.
We generate three datasets with increasing complexity for parameter estimation:
| Dataset | Number of Samples | PSF | Gaussian Noise Level | Signal-to-Noise Ratio |
|---|---|---|---|---|
| 1. | 200,000 | 0.5 | 200 | From 10 to 100 |
| 2. | 200,000 | 0.5 | From 200 to 400 | From 10 to 100 |
| 3. | 200,000 | From 0.5 to 1.0 | From 200 to 400 | From 10 to 100 |
We simulate 200,000 observations in total, using 180,000 samples for training and 20,000 for validation.
We estimate the following five parameters from image data:
Interactive app: To understand the relationship between galaxy parameters and the resulting images we create a web app at https://measure-galaxies.herokuapp.com:
Autoencoders: Have two potential advantages:
Neural architecture search: We run AutoKeras, an AutoML tool, to quickly test vanilla CNNs, ResNets, and Xception networks with different complexities, regularization, and normalization parameters.
Grid search of hyperparameters: We pick several key hyperparameters of the best model, expand their range, and evaluate the effect using a small portion of the data.
Denoising pipeline: Informed by a large gap between performance metrics for noiseless and noisy data, we test a two-stage pipeline described in Madireddy (2019) that uses a separate denoising network as the first step.
Grad-CAM and Saliency maps: examining these may provide insights for improving the estimation of parameters like Sérsic Index and Sérsic Radius.
Attention layers: focusing a network's attention on particular regions on the image might prove useful for estimating galaxy parameters.
Additional training: can help improve the performance of some of our models.
Ensuring robustness: by incorporating uncertainty into our point estimates using Bayesian or frequentist methods we can make sure that the models can safely be used on out-of-distribution data.
Real galaxy images: training and testing on real data will demonstrate whether our proof-of-concept approach is applicable to actual astrophysical tasks.
CS109B: Advanced Topics in Data ScienceSource code available at: github.com/dvukolov/cs109b-project